The purpose of our analysis involves exploring potential methods of increasing ridership for the DC Captial Bikeshare program. This motivation in focusing on ridership aligns with the DC Bikeshare’s main mission in transforming the Washingtion Metropolitan Area and residing communities by providing an affordable bicycle transit system. The goals outlined in this targeted reformation includes decreasing traffic congestion, promoting health and wellness, decreasing air pollution, and expanding transportation options. The DC Biekshare program is owned by several jurisdictions however, is mainly managed by the DC local government.
More information pertaining to the bikeshare program can be found by following this link: https://capitalbikeshare.com/
Following the methods outline in the CRISP-DM Methodology, we initially focused on understanding the various business aspects to our problem before analyzing any data. We determine our business objectives by researching the background of the DC Bikeshare program, as seen above. Next, we determine our data mining goals and identified a overall question we aim to solve for our analysis.
This question pertains to the following: How do we make it safer for people to ride bikes as a means of increasing ridership?
Baseline packages for our analysis:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # because we will probably see some dates
library(here) # more easily access files in your project
## Warning: package 'here' was built under R version 4.3.2
## here() starts at U:/DS241/Final-Bikeshare-Project_DS241_Team-4
library(mapview)
## Warning: package 'mapview' was built under R version 4.3.2
library(gbfs)
## Warning: package 'gbfs' was built under R version 4.3.2
library(sf) # working with simple features - geospatial
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(tmap)
## Warning: package 'tmap' was built under R version 4.3.2
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.3.2
library(stringr)
library(raster)
## Warning: package 'raster' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.2
##
## Attaching package: 'raster'
##
## The following object is masked from 'package:dplyr':
##
## select
library(readr)
library(maps)
## Warning: package 'maps' was built under R version 4.3.2
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
library(spData)
## Warning: package 'spData' was built under R version 4.3.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(viridis)
## Warning: package 'viridis' was built under R version 4.3.2
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
##
## The following object is masked from 'package:maps':
##
## unemp
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.2
library(ggbeeswarm)
## Warning: package 'ggbeeswarm' was built under R version 4.3.2
We are initially looking at the following three data sets:
bikes_september <- read_csv(here("data_raw","202309-capitalbikeshare-tripdata.csv")) %>% clean_names()
## Rows: 450090 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bikes_lanes = st_read(here("data_raw", "Bicycle_Lanes.geojson")) %>% clean_names()
## Reading layer `Bicycle_Lanes' from data source
## `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\Bicycle_Lanes.geojson'
## using driver `GeoJSON'
## Simple feature collection with 2275 features and 26 fields
## Geometry type: LINESTRING
## Dimension: XYZ
## Bounding box: xmin: -77.08773 ymin: 38.82359 xmax: -76.93066 ymax: 38.98276
## z_range: zmin: 0 zmax: 0
## Geodetic CRS: WGS 84
csv_bikes_lanes = read_csv(here("data_raw", "Bicycle_Lanes.csv")) %>% clean_names()
## Rows: 2275 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): ROUTENAME, SUBBLOCKKEY, BIKELANE_PARKINGLANE_ADJACENT, BIKELANE_TH...
## dbl (8): ROUTEID, ROADTYPE, QUADRANT, TOTALBIKELANES, TOTALBIKELANEWIDTH, W...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csv_bikes_crashes = read_csv(here("data_raw", "Crashes_in_DC.csv")) %>% clean_names()
## Rows: 67621 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): REPORTDATE, ROUTEID, FROMDATE, ADDRESS, WARD, EVENTID, MAR_ADDRESS...
## dbl (45): X, Y, OBJECTID, CRIMEID, CCN, MEASURE, OFFSET, STREETSEGID, ROADWA...
## lgl (1): TODATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dc_shape = st_read(here("data_raw","DC_Health_Planning_Neighborhoods.geojson")) %>% clean_names()
## Reading layer `DC_Health_Planning_Neighborhoods' from data source
## `U:\DS241\Final-Bikeshare-Project_DS241_Team-4\data_raw\DC_Health_Planning_Neighborhoods.geojson'
## using driver `GeoJSON'
## Simple feature collection with 51 features and 8 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99556
## Geodetic CRS: WGS 84
class(bikes_lanes)
## [1] "sf" "data.frame"
class(dc_shape)
## [1] "sf" "data.frame"
class(csv_bikes_crashes)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
class(csv_bikes_lanes)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
mapview(dc_shape, lwd=3, alpha=0.2) +
mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE)
Filtering the crash data. First, we looked at variables that we are interested in analyzing from the total data set. We then filtered the data to only look at accidents relating the bicycles. This was determined by setting a requirement that there must be at least 1 bicycle involved in the accident. Columns not relevant to our analysis was removed from the data set.
df1_bike_crashes = csv_bikes_crashes %>%
filter(total_bicycles >= "1") %>%
dplyr::select(-c(locationerror,objectid, crimeid, ccn, reportdate,routeid, streetsegid, roadwaysegid, todate,eventid, majorinjuries_driver, majorinjuries_pedestrian, majorinjuriespassenger,minorinjuries_driver, minorinjuries_pedestrian, minorinjuriespassenger,fatal_driver, fatal_pedestrian, fatalpassenger,unknowninjuries_driver,unknowninjuries_bicyclist,unknowninjuries_pedestrian,unknowninjuriespassenger, mar_address,mar_score,mar_id,total_vehicles,total_pedestrians,total_taxis,total_government,pedestriansimpaired,driversimpaired, speeding_involved, lastupdatedate))
sf_bike_crashes <- df1_bike_crashes %>%
st_as_sf(coords = c("longitude","latitude"), crs = 4326) %>%
st_cast("POINT")
class(sf_bike_crashes)
## [1] "sf" "tbl_df" "tbl" "data.frame"
mapview(dc_shape, lwd=3) +
mapview(bikes_lanes, zcol = "streetname", lwd = 3, layer.name = "bike_lanes", legend = FALSE) +
mapview(sf_bike_crashes, color = "red", lwd = 0.5, layer.name = "bike_crashes", legend = FALSE)